[DeepSpeed] scale grad for zero-2 by kashif · Pull Request #3880 · huggingface/accelerate

kashif · 2025-12-09T17:37:23Z

What does this PR do?

This pull request updates the backward method in src/accelerate/accelerator.py to ensure consistent loss scaling across all distributed training backends.

Distributed training consistency:

The loss is now always scaled by gradient_accumulation_steps, regardless of the backend, to prevent incorrect accumulation in DeepSpeed ZeRO-2 and similar scenarios. This change makes loss scaling explicit and consistent for all distributed types.

Fixes #3877

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-12-09T17:42:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stas00 · 2025-12-11T06:37:27Z

src/accelerate/accelerator.py

+        # Note: DeepSpeed does NOT automatically scale loss during backward for all ZeRO stages,
+        # particularly ZeRO-2 where gradient partitioning can cause incorrect accumulation
+        # if the loss is not pre-scaled. This ensures consistent behavior across all ZeRO stages.
+        loss = loss / self.gradient_accumulation_steps


Are you sure the issue isn't somewhere else?

You can see this is already done here:
https://github.com/deepspeedai/DeepSpeed/blob/b00b75f05852e0791f1e2b9c1cc894cd690e2da4/deepspeed/runtime/engine.py#L2482

and grads are scaled here:
https://github.com/deepspeedai/DeepSpeed/blob/b00b75f05852e0791f1e2b9c1cc894cd690e2da4/deepspeed/runtime/engine.py#L2360

so I suspect the above change is likely to break things, no?

cc: @tjruwase to double check.

yes agree! i was just wanted to test things... reverting

kashif · 2025-12-11T20:37:57Z

closing this

scale grad

4fe6859

kashif changed the title ~~scale grad~~ [DeepSpeed] scale grad for zero-2 Dec 9, 2025

kashif mentioned this pull request Dec 9, 2025

Gradient accumulation gives worse results when using DeepSpeed ZeRO 2 #3877

Closed

4 tasks

stas00 reviewed Dec 11, 2025

View reviewed changes

kashif closed this Dec 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSpeed] scale grad for zero-2#3880

[DeepSpeed] scale grad for zero-2#3880
kashif wants to merge 1 commit intohuggingface:mainfrom
kashif:deepspeed-grad-acc

kashif commented Dec 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Dec 9, 2025

Uh oh!

stas00 Dec 11, 2025

Uh oh!

kashif Dec 11, 2025

Uh oh!

kashif commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kashif commented Dec 9, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Dec 9, 2025

Uh oh!

stas00 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

kashif Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

kashif commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants